{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Install the numpy library if needed\n", "!pip install --user numpy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Install the scipy library if needed\n", "!pip install --user scipy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import numpy as np\n", "import scipy.stats as stats\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 4 - Probability distributions: Normal and Poisson\n", "\n", "Recall that a *probability distribution* is a math function that describes the probabilities of different outcomes. Sometimes, we want to find the probability distribution that describes the distribution of the data (called the *empirical distribution*).\n", "\n", "## Normal or Gaussian distribution\n", "\n", "In MAT 128 we looked at the normal (or Gaussian) distribution, which has the equation: $${\\displaystyle f(x\\mid \\mu ,\\sigma ^{2})={\\frac {1}{\\sqrt {2\\pi \\sigma ^{2}}}}e^{-{\\frac {(x-\\mu )^{2}}{2\\sigma ^{2}}}}}$$\n", "\n", "This distribution is a *parametric distribution* because the function depends on two parameters: the mean $\\mu$ (pronounced \"mu\"), and the standard deviation $\\sigma$ (pronounced \"sigma\").\n", "\n", "Let's plot the normal distribution. First define variables for mu and sigma so we can easily change them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mu = 0\n", "sigma = 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To plot graphs in Python, we actually plot a bunch of points very close together, and thus need to generate the x and y values for these points. First we generate 100 evenly spaces x values between mu - 3*sigma = -3 and mu+3*sigma = 3." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)\n", "x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each x value, we can compute the value of the normal distribution (the y value):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y = stats.norm.pdf(x, mu, sigma)\n", "y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: pdf standard for *probability density function* which is what a continuous probability distribution is called\n", "\n", "Now make a line plot of these x and y values:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because the x values were so close together, we can't tell that it is made up of a bunch of straight lines.\n", "\n", "In MAT 128 we sampled values from the normal distribution using the numpy library:\n", "We can sample 1000 values from this normal distribution (as done in MAT 128):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample = np.random.normal(loc = mu, scale = sigma, size = 1000)\n", "sample" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot a histogram of the sample. Remember you will first have to make it into a Pandas Series: `pd.Series(sample)` \n", "\n", "How does the histogram compare to the normal probability distribution?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To really compare the probability distribution with the histogram, we can plot the two in the same plot, using the `density` option to change the y axis of the histogram to the same scale as the probability distribution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normally distributed data\n", "\n", "Let's look at some normally distributed data. Download the text file `babyboom.dat.txt`. It contains data about 44 babies that were born on December 18, 1997 at the Mater Mothers' Hospital in Brisbane, Australia. \n", "\n", "Data Columns:\n", " - Time of birth recorded on the 24-hour clock\n", " - Sex of the child (1 = girl, 2 = boy)\n", " - Birth weight in grams\n", " - Number of minutes after midnight of each birth\n", "\n", "Data set from: https://raw.githubusercontent.com/cs109/2015lab3/master/babyboom.dat.txt\n", "Original references:\n", "Steele, S. (December 21, 1997), \"Babies by the Dozen for Christmas: 24-Hour Baby Boom,\" The Sunday Mail (Brisbane), p. 7\n", "\n", "Open the dataset in a text editor. How are the columns separated? Are there any columns names?\n", "\n", "Despite these issues we can still use `pd.read_csv()`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "babies = pd.read_csv(\"../data/babyboom.dat.txt\", header=None, sep='\\s+', names=['24hrtime','sex','weight','minutes'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The parameter `header=None` tells `read_csv()` that there is no header row with columns names, and instead we specify the column names with the parameter `names=['24hrtime','sex','weight','minutes']`. \n", "\n", "The parameter `sep='\\s+'` tells `read_csv()` that the columns are separated by white space (space, tab, etc.) instead of a comma." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "babies.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which column might be normally distributed?\n", "\n", "Let's plot a histogram of the weights:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To compare to a normal distribution, we should compute the mean and standard deviation of the weights:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can you plot the normal distribution with this mean and standard deviation on top of your histogram? Remember to change the y axis to `density`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do you think the weights are normally distributed?\n", "\n", "## Exponential distribution\n", "\n", "Let's look at one more distribution. The exponential distribution is used to model the time between events that happen independently. Here, we will use it to model the time between the babies' births.\n", "\n", "The function for the exponential distribution is: $$\n", "f(x;\\lambda) = \\begin{cases}\n", "\\lambda e^{-\\lambda x} \\quad \\text{if } x \\ge 0, \\\\\n", "0 \\quad \\quad \\text{if } x < 0.\n", "\\end{cases}\n", "$$\n", "\n", "The exponential distribution is also a parametric distribution. How many parameters does it have and what are they?\n", "\n", "$\\lambda$ (\"lambda\") is called the *rate parameter*, and is estimated as $\\frac{1}{\\text{mean of data}}$. \n", "\n", "\n", "Let's plot the explonential function. First create a set of `x` values between -2 and 5." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.linspace(-2, 5, 100)\n", "x\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next calculate the `y` values when lambda is 1. Note that since `lambda` is a reserved word in Python, we can't use it as our variable name and will use `lambda_` instead." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lambda_ = 1\n", "y = stats.expon.pdf(x, scale = 1/lambda_)\n", "y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the x and y values:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sampling from the Exponential Distribution\n", "\n", "Sample 1000 values from the exponential distribution with $\\lambda = 1$, using the command:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample = np.random.exponential(scale = 1/lambda_, size = 1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot a histogram of the sample to see its distribution: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How does the histogram compare to the theoretical distribution (the probability density function or pdf)?\n", "\n", "### Exponentially distributed data\n", "\n", "The time between births might be exponentially distributed, but we just have the number of minutes after midnight each baby was born. So we need to calculate the difference between the birth times. Luckily this type of calculation is done frequently in data science and there is already a function for it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mins_btw_births = babies[\"minutes\"].diff()\n", "mins_btw_births" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot a histogram of the number of minutes between births." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do you think the minutes between births has an exponential distribution? Let's plot the pdf for the exponential function on top of the histogram to compare. \n", "\n", "First, estimate $\\lambda$ by computing $\\frac{1}{\\text{mean time between births}}$:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next plot the pdf of the exponential function with that $\\lambda$ on top of your histogram (using the `density = True` parameter):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do you think the time between baby births follows an exponential distribution?\n", "\n", "The `scale` parameter is $1/\\lambda$ but since $\\lambda = \\frac{1}{\\text{mean time between births}}$ we could have just used the mean time between births as the scale." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Challenges:\n", "- What happens if you plot the normal distribution only using 10 evenly spaces x values?\n", "- You can also sample from the normal distribution using the scipy library: `sample = stats.norm.rvs(loc=mu, scale=sigma, size=1000)`. Try it.\n", "- Consider the green taxi trip dataset: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }